-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue-630: Changes to define the "CASEI" function for supporting case-insensitive string comparisons. #641
Conversation
…-insensitive string comparisons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a few topics for discussion.
Regarding the questions:
- The name CASEI works for me.
- Good question whether we want to support ACCENTI, too. Probably yes. If we add it, I think it should be in its own conformance class. Some backends may not support accent collation.
Meeting 2021-10-25: See the review comments for their resolution. Regarding the open questions:
|
I have a concern about the semantics of and complexity of implementing CASEI. My main concern is that the use of CASEI in an expression has a non-local effect on the evaluation of the expression, such that it's not actually a function. With any user-defined function, the function evaluates to a value. However, CASEI doesn't evaluate to a single value, but rather modifies the semantics of the operators in the expression it's used in. One of the examples is:
but this has the same semantics as:
since The presence of CASEI changes the semantics of the operators, rather than the value of the argument to the function. I feel like this is going to make it very difficult for implementers to get right, much more so than just having UPPER and LOWER functions. This gets complicated, especially when multiple functions are involved. For example, I'm not sure what the semantics of this expression would be: FOO(BAR(CASEI(road_class))) IN ('Οδος','Straße') The CASEI should do something, but it seems odd that FOO(BAR(CASEI(road_class))) and CASEI(FOO(BAR(road_class))) would have the same semantics. |
@philvarner where did you get that CASEI() has a non-local effect? That is not the intent and CASEI() has no effect on the semantics of the operator. The intent of |
Ah, so CASEI returns a new type that is this insensitive format, and then that can only be compared against other values of that same type. So something like |
One place in the spec that made me think it was a non-local effect is "Although the |
@philvarner |
Yes, I think describing case folding would be good, since I've been at this a long time I've never used it, and instead just always done lowercasing, which is probably a fault of my focus primarily on English. Glad I know about it now! The way I would describe it is that it returns a string typed representation of the input string that is guaranteed to be equal to any other case insensitive representation of that string. The Unicode specs I read say this is typically lowercase (in contrast to your example of it being uppercase 'STRASSE'), but I would stress to users that they should ONLY compare the results of two CASEI function expressions if they actually want to ensure a correct comparison, because the exact rules of conversion are opaque to them. So the only "durable" comparison is:
|
Co-authored-by: Phil Varner <[email protected]>
06-DEC-2021: There is an open question about whether CASEI() and ACCENTI() should be in their own conformance classes and we need feedback on that. @cportele and @pvretano feel that English implementations probably only care about CASEI() while European implementors would want ACCENTI() too. So based on that, separate conformance classes would make sense. If anyone has any feeling about that, please add your comments here and the SWG can circle back in the a future meeting. |
I'm usually an advocate finer-grained conformance classes, but I don't see a strong case here for separating them. English-only speakers may only see the use case for needing case-insensitive comparison, but the implementations they make will likely also have European language users and data who need the accent-insensitive comparison. I don't know of any databases that only have support for case and not accent. There's always the fall-back of simply implementing CASEI or ACCENTI as a custom function. |
From a German POV, I find the differentiation really weird. ß is supported in CASEI, but ä, ö and ü are not, which is surprising as from a German POV they are basically equivalent. The naming is also not ideal. In German (might be biased), I usually think about the French accents like á and à etc and not about ä and ö if I speak about accents. We call these umlauts and I doubt people will really make the connection. Overall, it feels like an implementation detail is shifted towards the user. I'd strongly step back from the English centric standpoint and think more global here. Aren't most environments nowadays supporting Unicode related functions? And if a back-end really can't support it and thinks he only delivers in English, then he can still simply not support it in CASEI and will not run into problems. So my +1 to merging ACCENTI and CASEI. |
I'm coming from a (non-developer) user point of view here. Why does a novice user need to understand the difference between CASEI and ACCENTI? As a normal user I just want to say that I don't care about casing and all the details should be handled by the implementation. If I let the users think about it, I make a complexity of the implementation (to support accents etc) be a complexity for the user... But maybe I'm oversimplifying things? |
@m-mohr for a user, this would definitely be nicer. |
I think so. Normalization of characters is orthogonal to case folding. If we merge the two, this would create issues for the cases where one wants one or the other, but not both. Do databases usually offer a convenience function that combines both? Even if not, should we provide one? |
In Swedish, ÅÄÖ are not just A and O with accents but their own separate letters. The words "Al" and "Ål" have completely different meaning, so it would make sense to me to want to want to search in a case insensitive way while still keeping the accents, so that I don't get a hit on "Ål" when searching for "al". |
Wait, I was only talking about casing, not about allowing ä for a etc. I wasn't even aware that the latter is tackled by those functions and it's indeed completely different thing and should likely be a separate function, but casing is casing regardless of what I pass in (I think). Maybe I misunderstood what the functions do. I'd expect that with CASEI:
What I could imagine is that ACCENTI (or whatever it's then called) allows:
That seperates two different use cases clearly and makes it easy to use (which might not be easy to implement). I don't have a strong preference whether it's one or two conformance classes. |
@cportele et all, FYI, most databases that I surveyed before creating this PR handle case insensitivity separately from accent insensitivity and many databases have a convention where letters such as '_AI' and '_CI' are appended to the collation name to indicate whether the collation is case insensitive, accent insensitive or both. The default collation for MySQL, for example, is 'latin1_swedish_ci' so it is a case insensitive -- but not accent insensitive -- collation. |
@m-mohr Sorry, my bad. I read "+1 to merging ACCENTI and CASEI" as merging the two functions, not the conformance classes. |
I'm agnostic about whether CASEI and ACCENTI are in one conformance class or two. I can see both sides. On the one hand these are related functions and so existing in one conformance class makes sense. One the other hand, languages have a mix of case and accent requirements and that sorta points to separate conformance classes ... although it seems kinda short sighted to develop software for a specific language or set of languages ... no!? ;) I'm happy to go with the consensus. |
To clarify, I think they must be two distinct functions, but I think it's okay to put them in the same conformance class, but I also don't have a problem with putting them in different conformance classes because of the uncertainty about whether users can or want to support both of them. |
One other comment to add about what the purpose of these two functions is. Both are useful to operate across data that has not been normalized or has been normalized to values that are different than they should be. CASEI is useful when I have different data that sets a field value to "PLANET", "Planet", or "planet" and I want to match either without having to enumerate all the variations. ACCENTI is useful when either: |
[[collation-def]] | ||
==== collation | ||
a set of rules that indicate how to compare and sort character string data | ||
the process of ordering units of textual information https://www.unicode.org/glossary/#collation[Glossary of Unicode Terms] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term “collation” does not seem to be used in the remainder of the document. Should it be removed from the terms & definitions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meeting 2022-01-17: Use the term in the description and keep the definition.
cql2/standard/requirements/accent-case-insensitive-comparison/REQ_accenti-builtin-function.adoc
Outdated
Show resolved
Hide resolved
^|A |The server SHALL support a built-in function named `ACCENTI`. | ||
^|B |The function SHALL accept one argument that can be a character string literal, the name of a property that evaluates to a string literal or a function that returns a string literal (see rules `characterLiteral`, `propertyName`, `function`). | ||
^|C |The function SHALL return a character string literal. | ||
^|D |The function SHALL implement https://www.w3.org/TR/charmod-norm/#unicodeNormalization[unicode normalization] described in the implementation guidelines of https://www.unicode.org/versions/Unicode14.0.0[The Unicode Standard, Version 14.0] (see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf[clause 5.6 Normalization]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the term “Unicode normalization” to the terms & definitions.
Unicode normalization
normalization
process of removing alternate representations of equivalent sequences from textual data, to convert the data into a form that can be binary-compared for equivalence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meeting 2022-01-17: Agreed.
cql2/standard/requirements/accent-case-insensitive-comparison/REQ_casei-builtin-function.adoc
Outdated
Show resolved
Hide resolved
^|A |The server SHALL support a built-in function named `CASEI`. | ||
^|B |The function SHALL accept one argument that can be a character string literal, the name of a property that evaluates to a string literal or a function that returns a string literal (see rules `characterLiteral`, `propertyName`, `function`). | ||
^|C |The function SHALL return a character string literal. | ||
^|D |The function SHALL implement the https://www.w3.org/TR/charmod-norm/#definitionCaseFolding[full case folding] algorithm defined in the implementation guidelines of https://www.unicode.org/versions/Unicode14.0.0[The Unicode Standard, Version 14.0] (see https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf[clause 5.18 Case Mappings, sub-clause Caseless Matching]). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding the term “Unicode case folding” to the terms & definitions:
Unicode case folding
case folding
process of making two texts which differ only in case identical for comparison purposes
Note: Case folding is meant for the purpose of string matching.
Source: https://www.w3.org/TR/charmod-norm/#definitionCaseFolding
Consider adding and using the term “Unicode case-insensitive matching” as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meeting 2022-01-17: Agreed.
…REQ_accenti-builtin-function.adoc Meeting 2022-01-17: Agreed. Co-authored-by: Heidi Vanparys <[email protected]>
…REQ_casei-builtin-function.adoc Meeting 2022-01-17: Agreed. Co-authored-by: Heidi Vanparys <[email protected]>
Meeting 2022-01-17: Define both functions in one requirements/conformance class. In most cases, implementing the functions will simply require calling the function in the underlying datastore, but if someone wants/needs to implement the capability themselves, this is a complex implementation. If it turns out that it would be better to have separate conformance declarations, we can add two additional conformance classes, one for each function. @pvretano will resolve the outstanding comments/edits and then merge the PR. |
==== | ||
|
||
---- | ||
etat_vol = ACCENTI('débárquér') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pvretano I know you just merged this, but shouldn't this be
ACCENTI(etat_vol) = ACCENTI('débárquér')
as as it is for CASEI? same for the JSON example.
@philvarner - I agree. There are also a few other comments/edits that were discussed in this PR that are not yet included. I am working on a follow-up PR. I will make that change there, too. cc @pvretano |
@philvarner yes, it should be |
@pvretano - I have already fixed it in my edits (the property references in the JSON also needed an update). |
There were still some outstanding comments/edits from the discussion in PR #641: * Add explanatory text (#641 (comment)) * Removed definition of "collation" - it was not used in the text and it is not necessary to introduce the term (#641 (comment)) * Add "unicode normalization" as a term (#641 (comment)) * Add "unicode case folding" as a term (#641 (comment)) * Correct examples (#641 (comment))
follow-up edits from PR #641 Meeting 2022-01-31: Merge the PR. Any additional comment would be addressed in a new PR.
Still in progress pending discussion in the SWG. Need to find out:
NOTE: In the processed of adding the CASEI function I noticed that the way we had defined the UPPER/LOWER functions in the JSON was not correct. UPPER/LOWER and CASEI are suppose to be built-in functions and so they should be encoded as a function would be but that was not the case. So, I fixed that.